巴西专利BR102018070673A2 GENERATE DIALOGUE BASED ON VERIFICATION SCORES

专利PDF首页>>巴西专利

专利附录

专利说明

权利要求

类似技术

同族专利

引用文献

法律状态

优先权

专利摘要:
"generate dialog based on verification scores". An exemplary dialog generating apparatus includes an audio receiver for receiving audio data including speech. The handset also includes a check score generator to generate a check score based on audio data. The apparatus further includes a user detector for detecting that the verification score exceeds a lower threshold but does not exceed an upper threshold. The apparatus includes a dialog generator for generating a dialog for requesting additional audio data to be used to generate an updated verification score in response to the detection that the verification score exceeds a lower threshold but does not exceed an upper threshold.
公开号:BR102018070673A2
申请号:R102018070673-0
申请日:2018-10-08
公开日:2019-06-04
发明作者:Jonathan Huang；David Pearce；Willem M. Beltman
申请人:Intel Corporation；
IPC主号:

专利说明:

Descriptive Report of the Invention Patent to GENERATE DIALOGUE BASED ON VERIFICATION SCORE. Background [0001] Natural voice interfaces can use automatic speech recognition (ASR) and natural language processing (NLP) to receive spoken commands from users and perform actions in response to spoken commands. For example, ASR can be used to convert spoken commands into a machine-readable format. You can then use NPL to translate machine-readable commands into one or more actions.
Brief Description of the Drawings [0002] Figure 1 is a block diagram illustrating an example of a processing pipeline to generate a speaker check score;
[0003] Figure 2 is a detailed flow chart illustrating an example of a process for generating dialogue based on a speaker check score;
[0004] Figure 3 is a block diagram illustrating an example of generating a speaker check score, for example, audio data received from a speaker;
[0005] Figure 4 is a graph illustrating an example of error detection commitment;
[0006] Figure 5 is a flowchart illustrating a method for generating a dialogue based on a speaker check score;
[0007] Figure 6 is a block diagram illustrating an example of a computer device that can generate a dialogue based on a speaker verification score; and [0008] Figure 7 is a block diagram showing computer-readable means that store code to generate a dialog
Petition 870180138577, of 10/08/2018, p. 59/119
2/46 based on a speaker verification score.
[0009] Throughout the disclosure and figures, the same numbers are used to indicate equal components and resources. Numbers in the 100 series refer to resources originally found in figure 1; numbers in the 200 series refer to resources originally found in figure 2; and so on.
Description of Modalities [0010] As discussed above, natural voice interfaces can be used for one or more services in response to the reception of spoken commands. For example, a natural voice interface can receive a spoken command and perform one or more tasks in response to the spoken command. However, some natural voice interfaces may not be able to recognize who is speaking. In addition, even if some natural voice systems include the ability to recognize who is speaking, those systems may have to make a decision on the speaker ID based on an initial entry. Making decisions based only on an entry can lead to errors in which a user may be rejected or incorrectly identified as another person and, consequently, cause frustration to the user.
[0011] This disclosure refers, in general, to techniques to generate a dialogue automatically. Specifically, the techniques described here include an apparatus, a method and a system for generating a dialogue based on a calculated verification score. In particular, the techniques described here can be used to determine when to generate additional dialogue in order to improve a system's confidence in a speaker's voice verification score. An example of an apparatus includes an audio receiver for receiving audio data including speech. The apparatus may include a key phrase detector for detecting a
Petition 870180138577, of 10/08/2018, p. 60/119
3/46 key phrase in the audio data. The device also includes a check score generator to generate a check score based on the audio data. The apparatus also includes a user detector to detect that the verification score exceeds a lower threshold, but does not exceed an upper threshold. The device also includes a dialog generator to generate a dialog to request additional audio data intended to be used to generate an updated check score in response to the detection that the check score exceeds a lower threshold, but does not exceed a threshold higher.
[0012] The techniques described here thus allow the dialogue flow to be adjusted when there is uncertainty in the speaker verification scores or measurements of the input signal quality indicate that the speaker recognition performance will be problematic due to environmental conditions. For example, an audio sample of user speech may be of poor quality due to background noise, or the audio sample may be too short for a high verification score. In addition, with speaker recognition capabilities, techniques can provide the ability to intelligently manage user profiles to make specific user content recommendations and allow access to certain limited tasks, such as controlling devices or placing orders. In addition, the techniques described foresee several improvements that enable an improved user experience when using speaker recognition.
[0013] Figure 1 is a block diagram illustrating an example of a processing pipeline to generate a speaker check score. The system example is indicated by reference number 100 and can be implemented in the computer device
600 below in figure 6 using method 500 in figure 5 below.
Petition 870180138577, of 10/08/2018, p. 61/119
4/46 [0014] System example 100 includes a speech receiver 102 communicatively coupled to a preprocessor 104. System 100 also includes a resource extractor 106 communicatively coupled to preprocessor 104. System 100 also includes a classifier 108 communicatively coupled to resource extractor 106. System 100 includes a speaker model 110 communicatively coupled to classifier 108. Classifier 108 is shown to emit a speaker identification score 112.
[0015] As shown in figure 1, system 100 can receive audio data including speech and emit a speaker identification score 112. For example, speaker identification score 112 may indicate the likelihood that a speech segment will be delivered by a specific registered speaker.
[0016] Speech receiver 102 can receive audio data including speech. For example, audio data can include a key phrase and a command. For example, the speech length in the audio data can be from a few seconds to a few minutes.
[0017] In some examples, the first processing step in the processing pipeline can be preprocessing the signal via a preprocessor 104 to improve speech quality. For example, using a series of microphones, a beam former can be used to maximize the signal-to-noise ratio (SNR) of speech by exploring the different directionality of speech and noise. In some examples, a reverberation of responses to acoustic space impulse may be applied. In some examples, other speech enhancement techniques may also be employed, such as spectral subtraction, Weiner filter and separation of blind sources.
[0018] The resource extractor 106 can receive the pre-processed audio data and process the pre-processed audio data to extract resources from the pre-processed audio data. For example,
Petition 870180138577, of 10/08/2018, p. 62/119
5/46 resource extraction can be a form of spectral analysis performed on tenths of milliseconds of speech frames.
[0019] Classifier 108 can take audio data input resources and generate a speaker check score 112 based on resources. For example, classifier 108 can take all audio data and calculate the probability that the speech corresponds to a registered speaker model 110. In some examples classifier 108 can use a speaker model 110 to calculate the speaker check score 112. For example, there may be a separate speaker model for each speaker that should be detected using the classifier 108. The issue of pipeline 100 above is a numeric speaker check score 112. For example, a speaker check score of a higher value may indicate a greater likelihood of a match with a 110 speaker model. In some instances, to accept or reject a speaker, a threshold value for the probability can be set. In some examples, the threshold can be defined based on a trade-off between the rate of false acceptance and the rate of false rejection, as described in more detail in relation to figure 4 below. In some examples, a verification score that incorporates the speaker verification score and a signal quality measurement score may be generated. For example, the verification score may incorporate the speaker verification score issued by the Speaker ID system and the proximity of the speaker verification score to any other speakers registered in the same system. The verification score can also incorporate signal quality measurements taken on the input signal that correlate with the expected performance of the Speaker ID system. For example, signal quality measurements can include background noise level, input signal level,
Petition 870180138577, of 10/08/2018, p. 63/119
6/46 signal-to-noise ratio, reverb measurement, input duration, etc.
[0020] In some examples, the verification score can then be compared to one or more thresholds. For example, an upper and lower threshold for the verification score can be set. For example, speech with a verification score below the lower threshold can be detected as originating from an unknown user while speaking with a verification score near the upper threshold can be detected as originating from a known user. In some examples, the verification score can be between a lower threshold and an upper threshold.
[0021] In some examples, a speech assistant may include a dialog engine that can control the flow of interaction with one or more users. For example, the flow of the dialogue may depend on confidence in the output from the speaker verification system. In some instances, when there is little confidence in the output from the speaker verification system; then an additional dialog can be generated to obtain more of the user's spoken input on which the speaker verification decision should be based. For example, the system can generate the additional dialog until the system relies on the score and also while not introducing significant additional checks for the user. As an example, additional dialogue can be generated when the check score is below a high threshold, but higher than a lower threshold and can be generated until the check score exceeds the upper threshold. In some examples, the dialog flow design can be made to sound natural to the user and therefore the user will not know that an additional verification of his or her voice is being made in the background.
[0022] A system using the techniques described here can
Petition 870180138577, of 10/08/2018, p. 64/119
7/46 thus adjust the flow of the interaction dialog with the speech assistant depending on the confidence in the speaker verification system. In some instances, if there is great confidence in the speaker verification decision, then the system can proceed to immediately detect a known or unknown user based only on the user's first utterance. In contrast, when there is little confidence in the speaker verification decision, then the system can add additional dialog steps to be able to capture more user speech on which to base its verification decision / speaker ID. In some instances, additional user input speech, which can be received as additional audio data, can be used in a number of ways to improve confidence in the user's identity. For example, the system can generate an updated verification score using only speech from additional dialogue periods. In some examples, the system may combine the scores from the additional audio data and the additional audio data. Speech verification confidence or score can improve with additional speech audio data for several reasons. For example, there may be more speech to generate speaker check scores and generally independent text systems perform better with longer input speech. In addition, in some examples, there may have been a transient external noise that occurred during the additional speech audio data while the second audio data has a better signal-to-noise ratio (SNR), thus improving the score value of resulting check.
[0023] In an example of a domestic setting, all family members can be users of the speech assistant and can thus be registered in the speaker verification system. Although the number of registered users in this scenario may be small, their voices
Petition 870180138577, of 10/08/2018, p. 65/119
8/46 can be similar because they are all from the same family. Thus, the speaker verification system may therefore be prone to confusion due to the similarity of the voices. Therefore, an adaptive system can be used to obtain additional speech through generated dialogue to improve user detection in a more natural way.
[0024] In some examples, a speaker ID system may produce one or more scores that provide a measure of confidence in the speaker ID. In some examples, the system can detect an identity of the closest speaker correspondence among the set of registered people and the speaker verification score or probability of utterance for the speaker model. In some instances, the system may use the score for the second closest matching speaker model. For example, the score for the second closest matching speaker model can be compared to the score for the best speaker model and therefore provide an alternative confidence measure. In some instances, the system may use the scores of all registered speakers. In some examples, the system may use a score from a model that represents a common user voice.
[0025] The diagram in figure 1 is not intended to indicate that example system 100 should include all components shown in figure 1. Instead, example system 100 can be implemented using fewer or additional components not shown in the figure 1 (for example, additional models, processing steps, issuing speaker verification scores, etc.). In some instances, system 100 may not include preprocessor 104. For example, resource extractor 106 may directly process audio data received from the speech receiver
Petition 870180138577, of 10/08/2018, p. 66/119
9/46
102. In another example, the resource extractor can be eliminated if the classifier is a deep neural network that takes natural speech data as input.
[0026] Figure 2 is a detailed flow chart illustrating an example of a process for generating a dialogue based on a speaker check score. The process example is generally indicated by reference number 200 and can be implemented on system 100 above or on computer device 600 below. For example, the process can be implemented using processor 602 of computer device 600 of figure 6 below.
[0027] In block 202, a processor receives audio data including speech. For example, audio data can be received from one or more microphones. In some examples, the speech may include a key phrase and a command. For example, the key phrase may be a predetermined wake-up phrase.
[0028] In decision diamond 204, the processor determines whether a key phrase is detected in the audio data. For example, the processor may be listening continuously to detect when a specific wake-up key phrase is spoken. An example of a sentence could be: Hello computer. In some examples, a key phrase detection algorithm can also provide the start and end points of a speech waveform so that text dependent speaker (SV TD) verification can be performed on the segment. In some examples, if the key phrase is not detected, then the process can continue in block 206. In some examples, if the key phrase is detected, then the process can continue in blocks 208 and 210.
[0029] In block 206, the processor may stop and wait for additional audio data to be received in block 202. In some instances, the processor may sleep or enter a sleep mode.
Petition 870180138577, of 10/08/2018, p. 67/119
10/46 stand-by, or perform other tasks. For example, the device may do nothing and return to the default mode.
[0030] In block 208, the processor calculates quality measurements of the incoming speech signal. For example, the processor can measure the quality of an audio input signal corresponding to the audio data. In some examples, the processor can calculate several measurements of signal quality that correlate with the ability to obtain speaker ID. For example, measurements can include an absolute noise level, an input speech signal level, a signal-to-noise ratio (SNR), a reverb amount, and a length of the command phrase portion of the audio data of input.
[0031] In block 210, the processor generates text-dependent speaker verification (SV) scores (TD) and text-independent speaker verification scores (SV). For example, the processor can use a portion of the key speech phrase in the received audio data that can be used for punctuation versus SV TD. Similarly, the processor can use the speech command portion in the audio data versus SV T1. For example, SV TD can have much lower error rates than Tl for very short utterances, so the two segments of the audio data can be separated and processed separately. In some examples, the two resulting scores can be combined together to obtain a more reliable classification. In some examples, the TD portion may be given greater weight by combining the scores. In some instances, the combined SV score can be processed for all speakers registered on the device. In some cases, the TD algorithm may use speech segments, both from the key phrase and command portions, to increase the amount of acoustic data
Petition 870180138577, of 10/08/2018, p. 68/119
11/46 that is provided to the classifier. In addition, as shown in block 212, one or more speaker models can be received in block 210. For example, a speaker model can be received so that each speaker is potentially detected.
[0032] In block 214, the processor combines the SV score and signal quality measurements to generate a verification score. For example, speaker verification scores or measurements of incoming speech signal quality can be used separately or combined to form a measure of total confidence in the person's ID from the utterance. In some instances, the verification score may be a score where a high score indicates a good match and a low score indicates a poor match. In some examples, alternatively, the verification score may be a probability. The combined verification score can be obtained using any suitable techniques. For example, the processor can generate the verification score using statistical measurements, empirical measurements, or machine learning, among other possible techniques for combining scores.
[0033] In decision diamond 216, the processor compares the verification score to one or more thresholds to determine whether the verification score exceeds the thresholds. In some instances, thresholds may include a lower threshold and an upper threshold. For example, if the upper threshold is exceeded, then the process may continue at block 218. In some instances, if the upper threshold is not exceeded, but the lower threshold is exceeded, then the process may continue at block 220. In some examples , if the lower threshold is not exceeded then the process can continue at block 224. For example, for the processor to decide whether a particular user is not someone outside the closed circle, the check score must be
Petition 870180138577, of 10/08/2018, p. 69/119
12/46 compared to one or more of the thresholds. In some examples, the threshold can be set to some target false acceptance rate (FAR) and an application false rejection rate (FRR). As used herein, FAR refers to a rate at which users are falsely detected as a known user. FRR refers to a rate at which users are falsely detected as unknown users. In some instances, thresholds may be different for different applications. For example, some applications can tolerate higher FAR instead of FRR and vice versa.
[0034] In block 218, the processor continues to generate a dialog assuming that a user is identified. In some examples, the processor may generate a dialog based on the detected user. For example, the processor can detect a great deal of confidence that the person has been identified and can proceed and generate a dialogue assuming that the person's identity is known. For example, generating a dialog can include generating statements or questions that correspond to the known user. In some examples, the processor can access a database with one or more stored preferences or other saved data associated with a known user to generate the dialog. In some instances, the processor may also take one or more actions in response to additional audio data received from a user. For example, actions can be taken in response to the receipt of one or more commands from the known user.
[0035] In block 220, the processor generates an additional dialog to determine the identity of a person. The processor can thus generate a dialog that does not assume that any user is identified. For example, the processor can generate a dialogue asking about the person's daily life, or another generalized dialogue. In some examples, the user may provide speech
Petition 870180138577, of 10/08/2018, p. 70/119
Additional 13/46 that the processor can use to increase the verification score above the second threshold. For example, if the verification score is between a lower threshold T1 and an upper threshold T2, this may indicate that there is some uncertainty about the user's identity. Therefore, the processor can proceed and add more dialogue periods to obtain more input from the user on which to make a more reliable determination. In some instances, this can occur for several reasons. For example, a registered speaker may have spoken under some variable conditions when compared to the registration conditions, thus producing poor correspondence. For example, changing conditions can include user illness, user mood, background noise, space acoustics, different microphones, etc. The resulting error rates due to changing conditions may possibly be too large for some applications. Furthermore, rejecting a speaker too early can lead to user frustration. Thus, the processor can generate additional dialog to collect more speech from the person on which to make a more informed determination as to whether the user is a registered user or not. In some instances, the processor can determine the user identity by explicitly asking the user if he or she is the closest match person. In some examples, depending on the security level of a system, the processor may also ask the user to answer a challenge question or provide a secret password. In some instances, the processor may engage the user in conversation based on the context of the current dialog. In some instances, the processor may inquire about additional relevant details about a user order.
[0036] In decision diamond 222, the processor determines whether
Petition 870180138577, of 10/08/2018, p. 71/119
14/46 found a matching user. For example, the corresponding user can be found in response to the detection that the verification score of 214 exceeds the upper threshold in relation to a particular speaker model associated with a user. In some examples, if a matching user is found, then the process can continue in block 218. On the contrary, if a matching user is not found, then the process can continue in block 224.
[0037] In block 224, the processor generates a dialog assuming an unknown user. For example, a poor match may have been obtained and the processor may generate a dialogue while continuing to assume that the person's identity is not known. In some examples, one or more features may be limited. For example, if after generating an additional dialog the user identity does not yet fit into one of the registered speakers, the processor can continue the interaction as a guest user. Access to private content will be blocked and there will be no specific user recommendations.
[0038] In an example of a multi-user dialog, there may be three users who are pre-registered and one user not registered. For example, a parent might ask: Hi computer, what's on TV tonight The system can answer: Hi Brian, there is a great action movie at 7pm that you might like. A mother may also ask: Hi computer, what's on TV tonight The system can answer: Hi Liz, your favorite fashion show starts at 8. Likewise, a 4 year old child can also ask: Hi computer, what's on TV tonight The system can answer: Hi Alice, Super Happy Fun Time starts in 10 minutes. The new unregistered user may request: Hello computer, set an alarm for 4:00. In this
Petition 870180138577, of 10/08/2018, p. 72/119
15/46 case, however, the system can respond: Hi, I'm sorry, but only family members can program alarms. Assuming that both TD and Tl models are registered, the system can use both parts of speech to determine the person's identity. For example, SV TD can be applied to detected key phrases and SV Tl can be applied to commands to detect each speaker. In some examples, both techniques can be used to generate a single speaker check score to detect the speaker, and to determine whether or not the speaker is registered in the system. Thus, specific features can be customized or limited to users who are registered in the system.
[0039] As another example of dialogue, a registered user can start by saying: Hello, computer, what's on TV tonight The processor can process this phrase using the techniques described above. However, the sentence can receive a check score between the lower threshold T1 and the upper threshold T2. The processor can thus request additional speech from the person to increase the verification or confidence score. For example, the processor can generate the dialog: Let me check the listings for you. So, please tell me about your day while I look for it The user can respond with the additional speech: I had a stressful day at work, preparing a great presentation in front of a huge audience. We are really pressed for time. I want to sit and relax. Thus, the processor can receive this additional speech as audio data, which can result in a higher verification score. For example, the verification score can now exceed the upper threshold T2 for a speaker model associated with a user named Dan. The processor can then generate dialogue assuming
Petition 870180138577, of 10/08/2018, p. 73/119
16/46 an identified user. For example, the processor may generate the dialogue: I'm sorry to hear that you're stressed out, Dan. To make you feel better, you'll be interested in seeing Game 7 of the MBH finals tonight on the XZY channel starting at 7pm. In some instances, the processor may have access to private content, such as favorite settings, music, television shows, team sports, etc. For example, the processor can access private content associated with the identified user in response to the detection of an identified user.
[0040] This process flow diagram is not intended to indicate that the 200 process block examples must be executed in any particular order, or that all blocks must be included in all cases. For example, a key phrase detection decision diamond 204 may be optional. In some examples, process 200 may continue from block 202 directly to blocks 208 and 210. In addition, any number of additional blocks not shown in process example 200 may be included, depending on the details of the particular implementation.
[0041] Figure 3 is a block diagram illustrating the generation of a speaker check score, for example, audio data received from a speaker. The example of generating the speaker verification score is generally indicated by reference number 300 and can be implemented on the computer device 600 below. For example, speaker verification score 300 can be generated using processing pipeline 100 in figure 1, processor 602 and speaker marker 634 from computer device 600 in figure 6 below, or speaker marker module 710 from computer readable means 700 of figure 7 below.
[0042] Figure 3 shows examples of audio data including speech received from a user. For example, speech can include the phrase:
Petition 870180138577, of 10/08/2018, p. 74/119
17/46
Hi computer, what's on TV tonight For example, the part of the phrase Hello Computer can be detected as a key phrase for waking 304, and the part of the phrase 302 what's on TV tonight can be detected as a 306 command for automatic speech recognition (ASR).
[0043] In block 308, the processor detects whether the wake-up key phrase 304 Hello computer is detected as a key phrase. Several different techniques can be used for key phrase detection. In some examples, a very reduced vocabulary automatic speech recognition algorithm (one or more words) is used to detect this key phrase. For example, one or more words can be used to detect the key phrase. In some examples, spectral resources can be used in the front end, followed by an acoustic model of deep neural network (DNN) with a hidden Markov model (HMM) as a key phrase model. In some instances, the DNN function can be expanded to obviate the need for spectral and HMM resources. For example, an end-to-end DNN classifier can be used to detect key phrases directly from natural speech. As used herein, the term DNN is intended to include many alternative forms of neural network types and topologies, such as a convolutional neural network (CNN), a long and short term memory network (LSTM), a recurrent neural network (RNN), fully connected layers, etc., or any combination thereof.
[0044] In block 310, the processor checks the text-dependent speaker (SV TD) in key phrase 304. For example, in text-dependent (TD), the words used to register a user and test a user can be the words same. Thus, in SV TD, it may be possible to use pass phrases to achieve EER below 1% under ideal conditions. For example, short pass phrases can
Petition 870180138577, of 10/08/2018, p. 75/119
18/46 have a length of 1 to 5 seconds such as hello computer. The registration may include only a few repetitions of the same phrase by the user to be registered. Thus, SV TD can be used to quickly authenticate a user with very little effort and registration time.
[0045] In block 312, the processor processes command 306 using voice activity detection 312. For example, voice activity detection (VAD) in a simpler form can be an energy detector. Voice can be detected when an energy in a segment exceeds the background noise level by some empirically determined threshold. In some examples, a more sophisticated VAD could use a DNN to classify whether an audio segment is speech or some other type of noise. In some examples, automatic speech recognition can be used to detect significant words or phonemes corresponding to the user's language.
[0046] In block 314, the processor performs an independent text speaker check (SV TI). For example, SV TI may not have any limitations on registration and test vocabulary, which allows SV TI to recognize speakers during natural conversational speech. In some instances, SV TI may take more than a minute of speech to register, and may use longer speech test segments to achieve EER comparable to SV TD. For example, the command what goes on TV tonight is twice as long as the key phrase hello computer.
[0047] In block 316, the processor performs a score merge to generate a single speaker check score.
In some examples, the processor may combine the SV score
TI and SV TD scores using any suitable technique to generate a combined SV score. For example, it can be
Petition 870180138577, of 10/08/2018, p. 76/119
19/46 a simple average or a weighted average is used. In some examples, as in the case of weighted average, the weighting can be determined by factors such as SNR, duration, phonetic richness of the segments, or a combination of them.
[0048] The diagram in figure 3 is not intended to indicate that the example of generating speaker check score 300 should include all of the components shown in figure 3. Instead, the example of generating speaker check score 300 it can be implemented using fewer components or additional components not shown in figure 3 (for example, additional key expressions, commands, speech, punctuation components, etc.).
[0049] Figure 4 is a graph illustrating an example of an error detection commitment. The example of an error detection commitment is generally indicated by reference number 400 and can be implemented in the computer device 600 below. For example, error detection compromise 400 can be used by user detector 640 of computer device 600 of figure 6 below, or by user detector module 716 of computer readable media 700 of figure 7 below. For example, error detection commitment can be used to define one or more thresholds for detecting a speaker.
[0050] Figure 4 shows error percentage rates 402 and 404, and a line of equal error rate 406 indicating equal false acceptance rate (FAR) and false rejection rate (FRR). The detection error plot 408 indicates all operating zones of an example system that can be reached by choosing different values for a threshold. For example, setting a high threshold value can lead to a false acceptance rate (FAR), but it can increase the false rejection rate (FRR). The opposite may be true with a lower threshold value. Thus, the intersection 410 of the error tracing line
Petition 870180138577, of 10/08/2018, p. 77/119
20/46 detection 408 and the equal error rate line 406 can be used to determine a threshold that can provide both a low FAR and a low FRR. For example, the FAR and FRR at intersection 410 of the trace line of detection error 408 and the line of equal error rate 406 are shown as 1%.
[0051] The diagram in figure 4 is not intended to indicate that the example of error detection commitment 400 should include all the components shown in figure 4. Instead, the example of error detection commitment 400 can be implemented using minus components or additional components not shown in figure 4 (for example, additional dimensions, detection error plot lines, etc.).
[0052] Figure 5 is a flow chart illustrating a method for generating a dialogue based on a speaker check score. The method example is generally indicated by reference number 500 and can be implemented using at least partially the processing pipeline 100 of figure 1 above, processor 602 of the computing device 600 of figure 6 below, or the computer-readable means 700 of figure 7 below.
[0053] In block 502, a processor receives audio data including speech. For example, the audio data can be an audio signal. In some instances, the speech may include a key phrase, a command, or both.
[0054] In block 504, the processor detects a key phrase in the audio data. For example, the key phrase can be a wake-up key phrase. In some examples, the key phrase may have been recorded for each user who is registered.
[0055] In block 506, the processor generates a verification score based on the audio data. In some instances, the processor may generate the verification score in response to the
Petition 870180138577, of 10/08/2018, p. 78/119
21/46 detection of the key phrase. For example, the processor can generate a speaker check score based on the audio data and a speaker template and generate the check score based on the speaker check score. In some examples, the processor can calculate a text dependent score based on the key phrase and an independent text score based on a command in the audio data and combine the text dependent score and the text independent score to generate a verification check score. speaker. For example, the processor can then generate the check score based on the speaker check score. In some examples, the processor may generate a signal quality score based on the audio data and generate the verification score based on the signal quality score. For example, the signal quality score can be generated based on a background noise level, an input signal level, a signal-to-noise ratio, a reverb measurement, an input duration, or any combination thereof. In some instances, the processor may generate the verification score based on the signal quality score, the speaker verification score, or both.
[0056] In decision diamond 508, the processor determines whether the verification score exceeds one or more thresholds. For example, thresholds can include a lower threshold and an upper threshold. In some examples, one or more thresholds can be defined based on an application. For example, the one or more thresholds can be set at least in part based on an application's target false acceptance rate (FAR) and an application's false rejection rate (FRR). In some examples, if the processor detects that the verification score does not exceed the lower threshold, then method 500 may continue at block 510. In some examples, if the processor
Petition 870180138577, of 10/08/2018, p. 79/119
22/46 detects that the verification score exceeds a lower threshold, but does not exceed an upper threshold, then method 500 may continue on block 512. In some instances, if the processor detects that the verification score exceeds both thresholds, then method 500 can continue on block 514.
[0057] In block 510, the processor detects an unknown user in response to the detection that the verification score does not exceed the lower threshold score. In some instances, the processor may generate a dialog denying access to restricted services in response to the detection of an unknown user. In some instances, the processor may generate a dialog or provide one or more unrestricted services in response to the detection of the unknown user.
[0058] In block 512, the processor generates a dialog to request additional audio data to be used to generate an updated check score in response to the detection that the check score exceeds a lower threshold, but does not exceed an upper threshold.
[0059] In block 514, the processor generates a response to the audio data based on the known user detected. For example, the processor may detect a known user in response to the detection that the verification score exceeds the upper threshold score.
[0060] This process flow diagram is not intended to indicate that the method 500 block examples must be executed in any particular order, or that all blocks must be included in all cases. For example, method 500 can be performed without detecting the key phrase in the audio data in block 504. In addition, any number of additional blocks not shown can be included in method example 500, depending on
Petition 870180138577, of 10/08/2018, p. 80/119
23/46 of the details of the particular implementation. For example, method 500 may also include preprocessing the audio data to eliminate noise from the audio data. In some examples, method 500 may include extracting resources from the audio data. For example, the speaker verification score can be generated based on the extracted resources.
[0061] Referring now to figure 6, a block diagram is shown illustrating an example of a computer device that can generate a dialogue based on a speaker verification score. The computing device 600 can be, for example, a laptop computer, desktop computer, tablet computer, mobile device, or usable device, among others. In some instances, computer device 600 may be a virtual assistant device. Computer device 600 may include a central processing unit (CPU) 602 that is configured to execute stored instructions, as well as a memory device 604 that stores instructions that are executable by CPU 602. CPU 602 can be coupled to the memory device 604 over a 606 bus. In addition, CPU 602 can be a single core processor, a multi-core processor, a cluster of computers, or any number of other configurations. In addition, computer device 600 may include more than one CPU 602. In some instances, CPU 602 may be a chip system (SoC) with a multi-core processor architecture. In some instances, CPU 602 may be a specialized digital signal processor (DSP) used for image processing. The memory device 604 may include random access memory (RAM), read-only memory (ROM), flash memory, or any other suitable memory systems. For example, memory device 604 can include dynamic random access memory
Petition 870180138577, of 10/08/2018, p. 81/119
24/46 (DRAM).
[0062] The memory device 604 may include random access memory (RAM), read-only memory (ROM), flash memory, or any other suitable memory systems. For example, memory device 604 may include dynamic random access memory (DRAM).
[0063] Computer device 600 may also include a graphics processing unit (GPU) 608. As shown, CPU 602 can be coupled via bus 606 to GPU 608. GPU 608 can be configured to perform any number of operations graphics on the computer device 600. For example, the GPU 608 can be configured to display or manipulate graphic images, graphic frames, videos, or the like, to be displayed on a user of the computer device 600.
[0064] The memory device 604 may include random access memory (RAM), read-only memory (ROM), flash memory, or any other suitable memory systems. For example, memory device 604 may include dynamic random access memory (DRAM). Memory device 604 can include device drivers 610 that are configured to execute instructions to generate a dialog based on a speaker check score. 610 device drivers can be software, an application program, application code, or the like.
[0065] CPU 602 can also be connected via bus 606 to an input / output (I / O) device interface 612 configured to connect computing device 600 to one or more 6 / I / O devices. The 614 may include, for example, a keyboard and a pointing device, where the pointing device may include a touchpad or touch screen, among others. The
Petition 870180138577, of 10/08/2018, p. 82/119
25/46 Ι / Ο devices 614 can be built-in components of computer device 600, or they can be devices that are connected externally to computer device 600. In some examples, memory 604 can be communicatively coupled to I / O devices 614 through memory direct access (DMA).
[0066] CPU 602 can also be connected via bus 606 to a display interface 616 configured to connect computer device 600 to a display device 618. Presentation device 618 may include a display screen that is an embedded component of the computing device 600. The display device 618 may also include a computer monitor, television, or projector, among others, that is connected internally or externally to the computing device 600.
[0067] Computer device 600 also includes a storage device 620. Storage device 620 is a physical memory such as a hard disk, an optical drive, a USB flash drive, a set of drives, a solid state drive, or any combinations thereof. The storage device 620 may also include remote storage drives.
[0068] Computer device 600 can also include a network interface controller (NIC) 622. NIC 622 can be configured to connect computer device 600 via bus 606 to a network 624. Network 624 can be a network of wide area (WAN), local area network (LAN), or the Internet, among others. In some examples, the device can communicate with other devices using wireless technology. For example, the device can communicate with other devices over a wireless local area network connection. In some instances, the device can connect and communicate with other devices using Bluetooth® or similar technology.
Petition 870180138577, of 10/08/2018, p. 83/119
26/46 [0069] The computing device 600 also includes a microphone 626. For example, microphone 626 can be a single microphone or a series of microphones.
[0070] Computer device 600 further includes an adaptive dialog speaker recognition device 628. For example, adaptive dialog speaker recognition device 628 can be used to generate a dialog to receive additional audio data used to detect a speaker. The adaptive dialog speaker recognition device 628 may include an audio receiver 630, a key phrase detector 632, a speaker marker 634, a signal quality marker 636, a check score generator 638, a user 640, and a dialog generator 642. In some examples, each of the 630-642 components of the adaptive dialog speaker recognition device 628 can be a microcontroller, integrated processor, or software module. The 630 audio receiver can receive audio data including speech. In some instances, the speech may include a key phrase, a command, or both. The key phrase detector 632 can detect a key phrase in the audio data. Speaker tag 634 can generate a speaker check score based on audio data and a speaker template. For example, speaker marker 634 can calculate a text dependent score based on the key phrase and an independent text score based on a command in the audio data, and combine the text dependent score and the text independent score to generate the speaker check score. The signal quality marker 636 can generate a signal quality score based on the audio data. For example, the signal quality score can be based on a background noise level, an input signal level, a
Petition 870180138577, of 10/08/2018, p. 84/119
27/46 signal-to-noise ratio, in a reverb measurement, in an input duration, or any combination thereof. The verification score generator 638 can generate a verification score based on the audio data. For example, the verification score generator 638 can generate the verification score in response to the detection of the key phrase. In some examples, the check score generator may generate an updated check score based on the additional audio data. For example, additional audio data can be received in response to the dialog generated by the 642 dialog generator below. User detector 640 can detect that the verification score exceeds a lower threshold, but does not exceed an upper threshold. In some instances, user detector 640 may detect an unknown user in response to receiving additional audio data from the user, and detect that the updated verification score exceeds a lower threshold, but does not exceed an upper threshold. In some instances, user detector 640 may detect a known user in response to the detection that the verification score exceeds the upper threshold score. In some instances, user detector 640 may detect an unknown user in response to the detection that the verification score does not exceed the lower threshold score. The dialog generator 642 can generate a dialog to request additional audio data to be used to generate an updated check score in response to the detection that the check score exceeds a lower threshold, but does not exceed an upper threshold. In some examples, the dialog generator 642 may generate a response to the audio data based on the known user detected. For example, the answer may include personalized information, such as favorite movies, games, news, shows, etc. In some examples, the dialog generator 642 can generate
Petition 870180138577, of 10/08/2018, p. 85/119
28/46 a response based on an unknown user detected. For example, the response may be a message denying access to restricted services.
[0071] The block diagram in figure 6 is not intended to indicate that the computing device 600 must include all components shown in figure 6. Instead, the computing device 600 may include fewer components or additional components not shown in figure 6 , such as additional buffers, additional processors, and the like. Computer device 600 may include any number of additional components not shown in figure 6, depending on the details of the particular implementation. For example, computer device 600 may also include a preprocessor for preprocessing audio data to eliminate noise. For example, the preprocessor can preprocess the audio data using any of the techniques described in figure 1 above. In some examples, computer device 600 may also include a resource extractor to extract resources from audio data. For example, speaker tag 634 can generate speaker check scores based on extracted resources. In addition, any of the features of the audio receiver 630, key phrase detector 632, speaker marker 634, signal quality marker 636, check score generator 638, user detector 640, and generator dialog box 642, can be fully or partially implemented in hardware and / or in the 602 processor. For example, the functionality can be implemented with an application-specific integrated circuit, in logic implemented in the 602 processor, or on any other device. In addition, any of the features of a 602 CPU can be fully or partially implemented in hardware and / or in a processor. For example, the functionality of the
Petition 870180138577, of 10/08/2018, p. 86/119
29/46 adaptive dialog speaker recognition 628 can be implemented with an application-specific integrated circuit, in logic implemented in a processor, in logic implemented in a specialized audio processing unit, or in any other device.
[0072] Figure 7 is a block diagram showing computer-readable means 700 that store code to generate a dialogue based on a speaker verification score. Computer-readable media 700 can be accessed by a processor 702 on a computer bus 704. In addition, computer-readable media 700 may include code configured to direct processor 702 to effect the methods described herein. In some embodiments, computer-readable media 700 may be non-transitory, computer-readable media. In some instances, the computer-readable media 700 may be storage media.
[0073] The various software components discussed here can be stored in one or more computer-readable media 700, as shown in figure 7. For example, an audio receiver module 706 can be configured to receive audio data including speech . The 708 key phrase detector module can be configured to detect a key phrase in the audio data. A 710 speaker marker module can be configured to generate a speaker check score based on audio data and a speaker template. For example, speaker marker 710 can be configured to calculate a text dependent score based on the key phrase and an independent text score based on a command in the audio data, combine the text dependent score and the text independent score for generate a speaker check score. In some instances, the
Petition 870180138577, of 10/08/2018, p. 87/119
Speaker tag module 710 can be configured to generate the speaker check score in response to the detection of the key phrase in the audio data. A signal quality marker module 712 can be configured to generate a signal quality score based on the audio data. For example, the signal quality marker module 712 can be configured to generate a signal quality score based on a background noise level, an input signal level, a signal-to-noise ratio, a measurement reverberation, in an entry duration, or any combination thereof. A check score generator module 714 can be configured to generate a check score based on the audio data in response to the detection of the key phrase. For example, the verification score generator module 714 can be configured to generate the verification score based on the speaker verification score, the signal quality score, or both. A user detector module 716 can be configured to detect that the verification score exceeds a lower threshold, but does not exceed an upper threshold. In some instances, the user detector module 716 can be configured to detect an unknown user in response to receiving additional audio data from the user and detect that the updated verification score exceeds a lower threshold, but does not exceed an upper threshold. For example, the checkpoint generator module 714 can be configured to generate an updated checkpoint based on additional audio data. In some examples, the user detector module 716 can be configured to detect a known user in response to the detection that the verification score exceeds the upper threshold score and generate a response to the audio data based on the known user
Petition 870180138577, of 10/08/2018, p. 88/119
31/46 detected. In some examples, the user detector module 716 can be configured to detect an unknown user in response to the detection that the verification score does not exceed the lower threshold score. A 718 dialog generator module can be configured to generate a dialog to request additional audio data to be used to generate an updated check score in response to the detection that the check score exceeds a lower threshold, but does not exceed a threshold higher. For example, the dialog may assume that the user is an unknown user. In some examples, the dialog generator module 718 can be configured to generate a dialog based on a known user. For example, the dialog may include personalized information, such as music, concerts, favorite locations, etc.
[0074] The block diagram of figure 7 is not intended to indicate that computer readable media 700 must include all components shown in figure 7. In addition, computer readable media 700 can include any number of additional components not shown in figure 7, depending on the details of the particular implementation. For example, computer readable media 700 may also include a preprocessor module for preprocessing audio data to eliminate noise from audio data. In some instances, computer-readable media 700 may include a resource extractor module for extracting resources from audio data. For example, speaker marker 710 can be configured to generate speaker check scores based on extracted resources. In some instances, computer-readable media 700 may include a natural language understanding module (NLU) to perform one or more actions. For example, the NLU module can perform restricted actions on
Petition 870180138577, of 10/08/2018, p. 89/119
32/46 response to the detection that the user is a known user. In some instances, the NLU module may return an access denied message to the dialog generator module 718 in response to the detection that an unknown user is attempting to request a restricted action. For example, restricted actions may include accessing functionality from one or more smart devices. EXAMPLES [0075] Example 1 is a device for generating a dialogue. The device includes an audio receiver for receiving audio data including speech. The device also includes a check score generator to generate a check score based on the audio data. The apparatus also includes a user detector to detect that the verification score exceeds a lower threshold, but does not exceed an upper threshold. The device also includes a dialog generator to generate a dialog to request additional audio data to be used to generate an updated check score in response to the detection that the check score exceeds a lower threshold, but does not exceed an upper threshold .
[0076] Example 2 includes the apparatus of example 1, including or excluding optional features. In this example, the apparatus includes a key phrase detector to detect a key phrase in the audio data. The check score generator is intended to generate a check score based on the audio data in response to the detection of the key phrase.
[0077] Example 3 includes the apparatus of any of Examples 1 to 2, including or excluding optional features. In this example, the device includes a speaker marker to generate a speaker check score based on the audio data and a speaker model. The verification score is at least in
Petition 870180138577, of 10/08/2018, p. 90/119
33/46 part based on speaker verification score.
[0078] Example 4 includes the apparatus of any of Examples 1 to 3, including or excluding optional features. In this example, the device includes a speaker marker to generate a speaker check score based on the audio data and a speaker model. The speaker marker is intended to calculate a text dependent score based on the key phrase and an independent text score based on a command in the audio data, and to combine the text dependent score and the text independent score to generate the score of speaker check. The verification score is at least partly based on the speaker verification score.
[0079] Example 5 includes the apparatus of any of Examples 1 to 4, including or excluding optional features. In this example, the device includes a signal quality marker to generate a signal quality score based on the audio data. The verification score is at least partly based on the signal quality score.
[0080] Example 6 includes the apparatus of any of Examples 1 to 5, including or excluding optional features. In this example, the device includes a signal quality marker to generate a signal quality score based on the audio data. The signal quality score is based on a background noise level, an input signal level, a signal-to-noise ratio, a reverb measurement, an input duration, or any combination thereof. The verification score is at least partly based on the signal quality score.
[0081] Example 7 includes the apparatus of any of Examples 1 to 6, including or excluding optional features. In this
Petition 870180138577, of 10/08/2018, p. 91/119
34/46 example, the device includes a pre-processor to pre-process the audio data to eliminate noise.
[0082] Example 8 includes the apparatus of any of Examples 1 to 7, including or excluding optional features. In this example, the device includes a resource extractor to extract resources from the audio data. A speaker marker is intended to generate a speaker check score based on the extracted resources and the check score generator is intended to generate a check score based on the speaker check score.
[0083] Example 9 includes the apparatus of any of Examples 1 to 8, including or excluding optional features. In this example, the user detector is intended to detect an unknown user in response to receiving additional audio data from the user, and to detect that the updated verification score exceeds a lower threshold, but does not exceed an upper threshold. The check score generator is intended to generate an updated check score based on the additional audio data.
[0084] Example 10 includes the apparatus of any of Examples 1 to 9, including or excluding optional features. In this example, the user detector is intended to detect a known user in response to the detection that the verification score exceeds the upper threshold score, the dialog generator to generate a response to the audio data based on the detected known user.
[0085] Example 11 is a method for generating a dialogue. The method includes receiving, via a processor, audio data including speech. The method also includes generating, through the processor, a verification score based on the audio data. The method
Petition 870180138577, of 10/08/2018, p. 92/119
35/46 also includes detecting, through the processor, that the verification score exceeds a lower threshold, but does not exceed an upper threshold. The method also includes generating, through the processor, a dialog to request additional audio data to be used to generate an updated check score in response to the detection that the check score exceeds a lower threshold, but does not exceed an upper threshold .
[0086] Example 12 includes the method of example 11, including or excluding optional features. In this example, the method includes detecting, through the processor, a key phrase in the audio data. The verification score is generated in response to the detection of a key phrase.
[0087] Example 13 includes the method of any of Examples 11 to 12, including or excluding optional features. In this example, generating the verification score includes calculating a text dependent score based on the key phrase and an independent text score based on a command in the audio data and combining the text dependent score and the text independent score to generate a score speaker check and generate the check score based on the speaker check score.
[0088] Example 14 includes the method of any of Examples 11 through 13, including or excluding optional features. In this example, generating the verification score includes generating a signal quality score based on the audio data and generating the verification score based on the signal quality score. The signal quality score is based on a background noise level, an input signal level, a signal-to-noise ratio, a reverb measurement, an input duration, or any combination thereof.
Petition 870180138577, of 10/08/2018, p. 93/119
36/46 [0089] Example 15 includes the method of any of Examples 11 through 14, including or excluding optional features. In this example, generating the verification score includes generating a signal quality score based on the audio data, generating a speaker verification score based on the audio data and a speaker model, and generating the verification score based on the score signal quality and speaker verification score.
[0090] Example 16 includes the method of any of Examples 11 to 15, including or excluding optional features. In this example, the method includes pre-processing, through the processor, the audio data to eliminate noise from the audio data.
[0091] Example 17 includes the method of any of Examples 11 to 16, including or excluding optional features. In this example, the method includes extracting resources from the audio data through the processor, generating a speaker check score based on the extracted resources, and generating the check score based on the speaker check score.
[0092] Example 18 includes the method of any of Examples 11 to 17, including or excluding optional features. In this example, the method includes detecting, through the processor, an unknown user in response to receiving additional audio data from the user, generating an updated check score based on the additional audio data, and detecting that the updated check score exceeds one lower threshold, but does not exceed an upper threshold.
[0093] Example 19 includes the method of any of Examples 11 to 18, including or excluding optional features. In this example, the method includes detecting, through the processor, a known user in response to the detection that the verification score
Petition 870180138577, of 10/08/2018, p. 94/119
37/46 exceeds the upper threshold score and generate a response to audio data based on the known user detected.
[0094] Example 20 includes the method of any of Examples 11 to 19, including or excluding optional features. In this example, the method includes detecting, through the processor, an unknown user in response to the detection that the verification score does not exceed the lower threshold score.
[0095] Example 21 is at least a computer-readable means to generate a dialogue having instructions stored there that direct the processor to receive audio data including speech. The computer-readable medium includes instructions that direct the processor to generate a verification score based on the audio data. The computer-readable medium also includes instructions that direct the processor to detect that the verification score exceeds a lower threshold, but does not exceed an upper threshold. The computer-readable medium also includes instructions that direct the processor to generate a dialog to request additional audio data to be used to generate an updated verification score in response to the detection that the verification score exceeds a lower threshold, but does not exceed an upper threshold.
[0096] Example 22 includes the computer-readable medium of example 21, including or excluding optional features. In this example, the computer-readable medium includes instructions for detecting a key phrase in the audio data. The verification score must be generated in response to the detection of a key phrase.
[0097] Example 23 includes the computer-readable medium of any of Examples 21 through 22, including or excluding optional features. In this example, the computer-readable medium includes instructions for calculating a text-dependent score based on
Petition 870180138577, of 10/08/2018, p. 95/119
38/46 in the key phrase and an independent text score based on a command in the audio data, combine the text dependent score and the text independent score to generate a speaker check score, and generate the check score based on speaker check score.
[0098] Example 24 includes the computer-readable medium of any of Examples 21 through 23, including or excluding optional features. In this example, the computer-readable medium includes instructions for generating a signal quality score based on the audio data, and generating the verification score based on the signal quality score. The signal quality score is based on a background noise level, an input signal level, a signal-to-noise ratio, a reverb measurement, an input duration, or any combination thereof.
[0099] Example 25 includes the computer-readable medium of any of Examples 21 to 24, including or excluding optional features. In this example, the computer-readable medium includes instructions for generating a signal quality score based on the audio data, generating a speaker check score based on the audio data and a speaker model, and generating the verification score based on signal quality score and speaker check score.
[0100] Example 26 includes the computer-readable medium of any of Examples 21 to 25, including or excluding optional features. In this example, the computer-readable medium includes instructions for preprocessing the audio data to eliminate noise from the audio data.
[0101] Example 27 includes the computer-readable medium of any of Examples 21 to 26, including or excluding resources
Petition 870180138577, of 10/08/2018, p. 96/119
39/46 optional. In this example, the computer-readable medium includes instructions for extracting resources from the audio data, generating a speaker check score based on the extracted resources, and generating the check score based on the speaker check score.
[0102] Example 28 includes the computer-readable medium of any of Examples 21 to 27, including or excluding optional features. In this example, the computer-readable medium includes instructions for detecting an unknown user in response to receiving additional audio data from the user, generating an updated check score based on the additional audio data, and detecting that the updated check score exceeds one lower threshold, but does not exceed an upper threshold.
[0103] Example 29 includes the computer-readable medium of any of Examples 21 to 28, including or excluding optional features. In this example, the computer-readable medium includes instructions for detecting a known user in response to the detection that the verification score exceeds the upper threshold score and generating a response to audio data based on the detected known user.
[0104] Example 30 includes the computer-readable medium of any of Examples 21 to 29, including or excluding optional features. In this example, the computer-readable medium includes instructions for detecting an unknown user in response to the detection that the verification score does not exceed the lower threshold score.
[0105] Example 31 is a system for generating a dialogue. The system includes an audio receiver to receive audio data including speech. The system includes a check score generator to generate a check score based on
Petition 870180138577, of 10/08/2018, p. 97/119
40/46 audio data. The system also includes a user detector to detect that the verification score exceeds a lower threshold, but does not exceed an upper threshold. The system also includes a dialog generator to generate a dialog to request additional audio data to be used to generate an updated check score in response to the detection that the check score exceeds a lower threshold, but does not exceed an upper threshold.
[0106] Example 32 includes the system of example 31, including or excluding optional features. In this example, the system includes a key phrase detector to detect a key phrase in the audio data. The check score generator is intended to generate a check score based on the audio data in response to the detection of a key phrase.
[0107] Example 33 includes the system of any of examples 31 to 32, including or excluding optional features. In this example, the system includes a speaker tag to generate a speaker check score based on the audio data and a speaker template. The verification score is at least partly based on the speaker verification score.
[0108] Example 34 includes the system of any of examples 31 to 33, including or excluding optional features. In this example, the system includes a speaker tag to generate a speaker check score based on the audio data and a speaker template. The speaker marker is intended to calculate a text dependent score based on a key phrase and a command independent text score based on a command in the audio data, and to combine the text dependent score and the text independent score to generate the speaker check score. The verification score is at least partly based on the speaker verification score.
Petition 870180138577, of 10/08/2018, p. 98/119
41/46 [0109] Example 35 includes the system of any of examples 31 to 34, including or excluding optional features. In this example, the system includes a signal quality marker to generate a signal quality score based on the audio data. The verification score is at least partly based on the signal quality score.
[0110] Example 36 includes the system of any of Examples 31 to 35, including or excluding optional features. In this example, the system includes a signal quality marker to generate a signal quality score based on the audio data. The signal quality score is based on a background noise level, an input signal level, a signal-to-noise ratio, a reverb measurement, an input duration, or any combination thereof. The verification score is at least partly based on the signal quality score.
[0111] Example 37 includes the system of any of examples 31 to 36, including or excluding optional features. In this example, the system includes a pre-processor to pre-process the audio data to eliminate noise.
[0112] Example 38 includes the system of any of Examples 31 to 37, including or excluding optional features. In this example, the system includes a resource extractor to extract resources from the audio data. The system includes a speaker marker to generate a speaker check score based on the extracted resources and the check score generator is intended to generate the check score based on the speaker check score.
[0113] Example 39 includes the system of any of examples 31 to 38, including or excluding optional features. In this
Petition 870180138577, of 10/08/2018, p. 99/119
42/46 example, the user detector is intended to detect an unknown user in response to receiving additional audio data from the user, and to detect that the updated verification score exceeds a lower threshold, but does not exceed an upper threshold. The check score generator is intended to generate an updated check score based on the additional audio data.
[0114] Example 40 includes the system of any of examples 31 to 39, including or excluding optional features. In this example, the user detector is intended to detect a known user in response to the detection that the verification score exceeds the upper threshold score, the dialog generator to generate a response to the audio data based on the detected known user.
[0115] Example 41 is a system for generating a dialogue. The system includes means for receiving audio data including speech. The system also includes means for generating a verification score based on the audio data. The system also includes means to detect that the verification score exceeds a lower threshold, but does not exceed an upper threshold. The system also includes means for generating a dialog to request additional audio data to be used to generate an updated check score in response to the detection that the check score exceeds a lower threshold, but does not exceed an upper threshold.
[0116] Example 42 includes the system of example 41, including or excluding optional features. In this example, the system includes means for detecting a key phrase in the audio data. The means for generating the verification score is to generate a verification score based on the audio data in response to the detection of a key phrase.
Petition 870180138577, of 10/08/2018, p. 100/119
43/46 [0117] Example 43 includes the system of any of Examples 41 to 42, including or excluding optional features. In this example, the system includes a means to generate a speaker check score based on the audio data and a speaker model. The verification score is at least partly based on the speaker verification score.
[0118] Example 44 includes the system of any of examples 41 to 43, including or excluding optional features. In this example, the system includes a means to generate a speaker check score based on the audio data and a speaker model. The means to generate the speaker check score is intended to calculate a text dependent score based on a key phrase and an independent text score based on a command in the audio data, and to combine the text dependent score and the independent score text to generate the speaker check score. The verification score is at least partly based on the speaker verification score.
[0119] Example 45 includes the system of any of Examples 41 to 44, including or excluding optional features. In this example, the system includes means for generating a signal quality score based on the audio data. The verification score is at least partly based on the signal quality score.
[0120] Example 46 includes the system of any of Examples 41 to 45, including or excluding optional features. In this example, the system includes means for generating a signal quality score based on the audio data. The signal quality score is based on a background noise level, an input signal level, a signal-to-noise ratio, a reverb measurement, an input duration, or whatever
Petition 870180138577, of 10/08/2018, p. 101/119
44/46 combination thereof. The verification score is at least partly based on the signal quality score.
[0121] Example 47 includes the system of any of examples 41 to 46, including or excluding optional features. In this example, the system includes means for pre-processing the audio data to eliminate noise.
[0122] Example 48 includes the system of any of Examples 41 to 47, including or excluding optional features. In this example, the system includes means for extracting resources from the audio data. The means for generating a speaker check score is intended to generate a speaker check score based on the resources extracted and the means for generating check score is for generating check score based on the speaker check score.
[0123] Example 49 includes the system of any of Examples 41 to 48, including or excluding optional features. In this example, the means for detecting that the verification score exceeds the lower threshold, but does not exceed the upper threshold, is intended to detect an unknown user in response to receiving the user's additional audio data, and to detect that the updated verification score exceeds a lower threshold, but does not exceed an upper threshold. The means for generating the verification score is to generate an updated verification score based on the additional audio data.
[0124] Example 50 includes the system of any of Examples 41 to 49, including or excluding optional features. In this example, the means for detecting that the verification score exceeds the lower threshold but does not exceed the upper threshold is intended to detect a known user in response to the detection that the verification score exceeds the upper threshold score, the
Petition 870180138577, of 10/08/2018, p. 102/119
45/46 means to generate the dialogue to generate a response to the audio data based on the detected known user.
[0125] Not all components, resources, structures, characteristics, etc. described and illustrated here have to be included in a particular aspect or aspects. If the specification says that a component, feature, structure, or feature can, it could be included, for example, that particular component, feature, structure, or feature does not have to be included. If the specification or the claims refer to an element, this does not mean that it is only one of the elements. If the specification or the claims refer to an additional element, this does not exclude that there is more than one additional element.
[0126] It should be noted that, although some aspects have been described with reference to particular implementations, other implementations are possible according to some aspects. In addition, the arrangement and / or order of circuit elements or other features illustrated in the drawings and / or described herein does not have to be arranged in the particular manner illustrated and described. Many other arrangements are possible according to some aspects.
[0127] In each system shown in a figure, each of the elements may, in some cases, have the same reference number or a different reference number to suggest that the elements represented could be different and / or similar. However, an element can be flexible enough to have different implementations and work with some or all of the systems, as shown or described here. The various elements shown in the figures can be the same or different. Each of them is designated as a first element and what is designated as a second element is arbitrary.
[0128] It should be understood that the specifics in the examples
Petition 870180138577, of 10/08/2018, p. 103/119
46/46 previously mentioned can be used anywhere in one or more aspects. For example, all the optional features of the computer device described above can also be implemented in relation to any of the methods or the computer-readable medium described here. In addition, although flow diagrams and / or state diagrams may have been used here to describe aspects, the techniques are not limited to those diagrams or the corresponding descriptions contained herein. For example, the flow does not have to move through each illustrated box or state or in exactly the same order as illustrated and described herein.
[0129] Current techniques are not restricted to particular details indicated here. Indeed, those skilled in the art having the benefit of this disclosure will appreciate that many other variants of the preceding description and drawings can be made within the scope of current techniques. Consequently, it is the claims that follow including any changes to them that define the scope of current techniques.

权利要求:
Claims (25)
[1]
1. Device for generating dialogue characterized by the fact that it comprises:
an audio receiver for receiving audio data comprising speech;
a check score generator to generate a check score based on the audio data;
a user detector to detect that the verification score exceeds a lower threshold, but does not exceed an upper threshold; and a dialog generator to generate a dialog to request additional audio data to be used to generate an updated check score in response to the detection that the check score exceeds a lower threshold, but does not exceed an upper threshold.
[2]
2. Apparatus according to claim 1, characterized by the fact that it comprises a key phrase detector to detect a key phrase in the audio data, in which the check score generator is intended to generate a check score based on the audio data in response to the detection of the key phrase.
[3]
3. Apparatus according to claim 1, characterized by the fact that it comprises a speaker marker to generate a speaker check score based on the audio data and a speaker model, where the check score is at least least partly based on the speaker check score.
[4]
4. Apparatus according to claim 1, characterized by the fact that it comprises a speaker marker to generate a speaker check score based on the
Petition 870180138577, of 10/08/2018, p. 105/119
2/7 audio data and in a speaker model, where the speaker marker is intended to calculate a text dependent score based on the key phrase and an independent text score based on a command in the audio data, and combine the text-dependent punctuation and text-independent punctuation to generate the speaker check score, where the check score is at least partly based on the speaker check score.
[5]
5. Apparatus according to claim 1, characterized by the fact that it comprises a signal quality marker to generate a signal quality score based on the audio data, in which the verification score is at least in part based on the signal quality score.
[6]
Apparatus according to any one of claims 1 to 5, characterized by the fact that it comprises a signal quality marker to generate a signal quality score based on the audio data, in which the signal quality score is based on a background noise level, an input signal level, a signal-to-noise ratio, a reverberation measurement, an input duration, or any combination thereof, and where the verification score is at least partly based on the signal quality score.
[7]
Apparatus according to any one of claims 1 to 5, characterized in that it comprises a pre-processor to pre-process the audio data to remove noise.
[8]
8. Apparatus according to any one of claims 1 to 5, characterized by the fact that it comprises a resource extractor for extracting resources from the audio data, in which a speaker marker is intended to generate a score of
Petition 870180138577, of 10/08/2018, p. 106/119
3/7 speaker verification based on extracted resources and the verification score generator is intended to generate the verification score based on the speaker verification score.
[9]
Apparatus according to any one of claims 1 to 5, characterized by the fact that the user detector is intended to detect an unknown user in response to the receipt of the user's additional audio data, and to detect that the verification score updated exceeds a lower threshold, but does not exceed an upper threshold, where the checkpoint generator is intended to generate an updated checkpoint based on the additional audio data.
[10]
Apparatus according to any one of claims 1 to 5, characterized by the fact that the user detector is intended to detect a known user in response to the detection that the verification score exceeds the upper threshold score, the dialog generator to generate a response to the audio data based on the known user detected.
[11]
11. Method for generating a dialogue characterized by the fact that it comprises:
receiving, through a processor, audio data comprising speech;
generate, through the processor, a verification score based on the audio data;
detect, through the processor, that the verification score exceeds a lower threshold, but does not exceed an upper threshold; and generate, through the processor, a dialog to request additional audio data to be used to generate an updated check score in response to the detection that the check score exceeds a lower threshold, but does not exceed
Petition 870180138577, of 10/08/2018, p. 107/119
4/7 an upper threshold.
[12]
12. Method, according to claim 11, characterized by the fact that it comprises detecting, through the processor, a key phrase in the audio data, in which the verification score is generated in response to the detection of the key phrase.
[13]
13. Method, according to claim 11, characterized by the fact that generating the verification score comprises calculating a text dependent score based on the key phrase and an independent text score based on a command in the audio data and combining the score text-dependent and text-independent punctuation to generate a speaker check score and generate check score based on speaker check score.
[14]
14. Method, according to claim 11, characterized by the fact that generating the verification score comprises generating a signal quality score based on the audio data and generating the verification score based on the signal quality score, in which the signal quality score is based on a background noise level, an input signal level, a signal-to-noise ratio, a reverberation measurement, an input duration, or any combination thereof.
[15]
15. Method, according to claim 11, characterized by the fact that generating the verification score comprises generating a signal quality score based on the audio data, generating a speaker verification score based on the audio data and on a speaker model, and generate the verification score based on the signal quality score and the speaker verification score.
[16]
16. Method, according to any of the
Petition 870180138577, of 10/08/2018, p. 108/119
5/7 claims 11 to 15, characterized by the fact that it comprises pre-processing, through the processor, the audio data to remove noise from the audio data.
[17]
17. Method according to any one of claims 11 to 15, characterized by the fact that it comprises extracting, through the processor, resources from the audio data, generating a speaker check score based on the extracted resources, and generating the score from verification based on the speaker verification score.
[18]
18. Method according to any one of claims 11 to 15, characterized by the fact that it comprises detecting, through the processor, an unknown user in response to receiving additional audio data from the user, generating an updated verification score based on additional audio data, and detect that the updated verification score exceeds a lower threshold, but does not exceed an upper threshold.
[19]
19. Method according to any of claims 11 to 15, characterized by the fact that it comprises detecting, through the processor, a user known in response to the detection that the verification score exceeds the upper threshold score and generating a response to audio data based on the known user detected.
[20]
20. Method according to any one of claims 11 to 15, characterized by the fact that it comprises detecting, through the processor, an unknown user in response to the detection that the verification score does not exceed the lower threshold score.
[21]
21. System to generate a dialogue characterized by the fact that it comprises:
means for receiving audio data comprising speech;
Petition 870180138577, of 10/08/2018, p. 11/119
6/7 means to generate a verification score based on the audio data;
means for detecting that the verification score exceeds a lower threshold, but does not exceed an upper threshold; and means for generating a dialog to request additional audio data to be used to generate an updated check score in response to the detection that the check score exceeds a lower threshold, but does not exceed an upper threshold.
[22]
22. System according to claim 21, characterized by the fact that it comprises means for detecting a key phrase in the audio data, in which the means for generating the verification score are intended to generate a verification score based on the data of in response to the detection of the key phrase.
[23]
23. System according to claim 21, characterized by the fact that it comprises means to generate a speaker check score based on audio data and a speaker model, where the check score is at least in part based on the speaker verification score.
[24]
24. System according to any one of claims 21 to 23, characterized in that it comprises means for generating a speaker check score based on audio data and a speaker model, in which the means for generating the score speaker verification are intended to calculate a text dependent score based on the key phrase and a command independent text score based on a command in the audio data, and to combine the text dependent score and the text independent score to generate the score of speaker verification, where the verification score is at least partly
Petition 870180138577, of 10/08/2018, p. 110/119
7/7 based on the speaker verification score.
[25]
25. System according to any one of claims 21 to 23, characterized in that it comprises means for generating a signal quality score based on audio data, in which the verification score is at least partly based in the signal quality score.

类似技术:

公开号 | 公开日 | 专利标题

BR102018070673A2|2019-06-04|GENERATE DIALOGUE BASED ON VERIFICATION SCORES

US10504511B2|2019-12-10|Customizable wake-up voice commands

US10255922B1|2019-04-09|Speaker identification using a text-independent model and a text-dependent model

US10453460B1|2019-10-22|Post-speech recognition request surplus detection and prevention

US20170256270A1|2017-09-07|Voice Recognition Accuracy in High Noise Conditions

JP2017536568A|2017-12-07|Enhanced key phrase user recognition

JP6200516B2|2017-09-20|Speech recognition power management

JP2017050010A|2017-03-09|Hybrid performance scaling or speech recognition

KR20160115944A|2016-10-06|Systems and methods for evaluating strength of an audio password

US9354687B2|2016-05-31|Methods and apparatus for unsupervised wakeup with time-correlated acoustic events

US20150302856A1|2015-10-22|Method and apparatus for performing function by speech input

EP3210205B1|2020-05-27|Sound sample verification for generating sound detection model

JP2007133414A|2007-05-31|Method and apparatus for estimating discrimination capability of voice and method and apparatus for registration and evaluation of speaker authentication

US10762899B2|2020-09-01|Speech recognition method and apparatus based on speaker recognition

CN111699528A|2020-09-22|Electronic device and method for executing functions of electronic device

US10832671B2|2020-11-10|Method and system of audio false keyphrase rejection using speaker recognition

US10490195B1|2019-11-26|Using system command utterances to generate a speaker profile

US9466286B1|2016-10-11|Transitioning an electronic device between device states

US9335966B2|2016-05-10|Methods and apparatus for unsupervised wakeup

US10147423B2|2018-12-04|Context-aware query recognition for electronic devices

CN109272991B|2021-11-02|Voice interaction method, device, equipment and computer-readable storage medium

US10699706B1|2020-06-30|Systems and methods for device communications

US20200152206A1|2020-05-14|Speaker Identification with Ultra-Short Speech Segments for Far and Near Field Voice Assistance Applications

EP3195314B1|2021-12-08|Methods and apparatus for unsupervised wakeup

US10847154B2|2020-11-24|Information processing device, information processing method, and program

同族专利:

公开号 | 公开日

US10515640B2|2019-12-24|

DE102018126133A1|2019-05-09|

US20190027152A1|2019-01-24|

引用文献:

公开号 | 申请日 | 公开日 | 申请人 | 专利标题

US5913192A|1997-08-22|1999-06-15|At&T Corp|Speaker identification with user-selected password phrases|

US6401063B1|1999-11-09|2002-06-04|Nortel Networks Limited|Method and apparatus for use in speaker verification|

US6813341B1|2000-08-31|2004-11-02|Ivoice, Inc.|Voice activated/voice responsive item locator|

US6728679B1|2000-10-30|2004-04-27|Koninklijke Philips Electronics N.V.|Self-updating user interface/entertainment device that simulates personal interaction|

US6964023B2|2001-02-05|2005-11-08|International Business Machines Corporation|System and method for multi-modal focus detection, referential ambiguity resolution and mood classification using multi-modal input|

GB2372864B|2001-02-28|2005-09-07|Vox Generation Ltd|Spoken language interface|

TW518483B|2001-08-14|2003-01-21|Ind Tech Res Inst|Phrase verification method using probability-oriented confidence tag|

US20040186724A1|2003-03-19|2004-09-23|Philippe Morin|Hands-free speaker verification system relying on efficient management of accuracy risk and user convenience|

GB0426347D0|2004-12-01|2005-01-05|Ibm|Methods, apparatus and computer programs for automatic speech recognition|

US8301757B2|2007-06-11|2012-10-30|Enghouse Interactive Inc.|System and method for obtaining in-use statistics for voice applications in interactive voice response systems|

US8442824B2|2008-11-26|2013-05-14|Nuance Communications, Inc.|Device, system, and method of liveness detection utilizing voice biometrics|

US20130106894A1|2011-10-31|2013-05-02|Elwha LLC, a limited liability company of the State of Delaware|Context-sensitive query enrichment|

US20130226892A1|2012-02-29|2013-08-29|Fluential, Llc|Multimodal natural language interface for faceted search|

JP6424628B2|2013-01-17|2018-11-21|日本電気株式会社|Speaker identification device, speaker identification method, and program for speaker identification|

EP2797078B1|2013-04-26|2016-10-12|Agnitio S.L.|Estimation of reliability in speaker recognition|

PT2994908T|2013-05-07|2019-10-18|Veveo Inc|Incremental speech input interface with real time feedback|

US10209853B2|2013-12-20|2019-02-19|Robert Bosch Gmbh|System and method for dialog-enabled context-dependent and user-centric content presentation|

US9257120B1|2014-07-18|2016-02-09|Google Inc.|Speaker verification using co-location information|

US10884503B2|2015-12-07|2021-01-05|Sri International|VPA with integrated object recognition and facial expression recognition|

US20180167678A1|2016-12-14|2018-06-14|Echostar Technologies L.L.C.|Interactive media system|US10255566B2|2011-06-03|2019-04-09|Apple Inc.|Generating and processing task items that represent tasks to perform|

CN113470640A|2013-02-07|2021-10-01|苹果公司|Voice trigger of digital assistant|

US9715875B2|2014-05-30|2017-07-25|Apple Inc.|Reducing the need for manual start/end-pointing and trigger phrases|

US10170123B2|2014-05-30|2019-01-01|Apple Inc.|Intelligent assistant for home automation|

US9886953B2|2015-03-08|2018-02-06|Apple Inc.|Virtual assistant activation|

US10200824B2|2015-05-27|2019-02-05|Apple Inc.|Systems and methods for proactively identifying and surfacing relevant content on a touch-sensitive device|

US10747498B2|2015-09-08|2020-08-18|Apple Inc.|Zero latency digital assistant|

US10586535B2|2016-06-10|2020-03-10|Apple Inc.|Intelligent digital assistant in a multi-tasking environment|

DK201670540A1|2016-06-11|2018-01-08|Apple Inc|Application integration with a digital assistant|

US10468032B2|2017-04-10|2019-11-05|Intel Corporation|Method and system of speaker recognition using context aware confidence modeling|

WO2019002831A1|2017-06-27|2019-01-03|Cirrus Logic International Semiconductor Limited|Detection of replay attack|

GB201713697D0|2017-06-28|2017-10-11|Cirrus Logic Int Semiconductor Ltd|Magnetic detection of replay attack|

GB201801528D0|2017-07-07|2018-03-14|Cirrus Logic Int Semiconductor Ltd|Method, apparatus and systems for biometric processes|

GB201801530D0|2017-07-07|2018-03-14|Cirrus Logic Int Semiconductor Ltd|Methods, apparatus and systems for authentication|

GB201801527D0|2017-07-07|2018-03-14|Cirrus Logic Int Semiconductor Ltd|Method, apparatus and systems for biometric processes|

GB2567503A|2017-10-13|2019-04-17|Cirrus Logic Int Semiconductor Ltd|Analysing speech signals|

GB201801661D0|2017-10-13|2018-03-21|Cirrus Logic International Uk Ltd|Detection of liveness|

GB201801663D0|2017-10-13|2018-03-21|Cirrus Logic Int Semiconductor Ltd|Detection of liveness|

GB201801659D0|2017-11-14|2018-03-21|Cirrus Logic Int Semiconductor Ltd|Detection of loudspeaker playback|

US11264037B2|2018-01-23|2022-03-01|Cirrus Logic, Inc.|Speaker identification|

US10928918B2|2018-05-07|2021-02-23|Apple Inc.|Raise to speak|

DK180639B1|2018-06-01|2021-11-04|Apple Inc|DISABILITY OF ATTENTION-ATTENTIVE VIRTUAL ASSISTANT|

DK179822B1|2018-06-01|2019-07-12|Apple Inc.|Voice interaction at a primary device to access call functionality of a companion device|

US10818296B2|2018-06-21|2020-10-27|Intel Corporation|Method and system of robust speaker recognition activation|

US11037574B2|2018-09-05|2021-06-15|Cirrus Logic, Inc.|Speaker recognition and speaker change detection|

WO2020076345A1|2018-10-08|2020-04-16|Google Llc|Selective enrollment with an automated assistant|

US11238294B2|2018-10-08|2022-02-01|Google Llc|Enrollment with an automated assistant|

DK180129B1|2019-05-31|2020-06-02|Apple Inc.|User activity shortcut suggestions|

US10819522B1|2020-01-03|2020-10-27|BlockGen Corp.|Systems and methods of authentication using entropic threshold|

法律状态:
2019-06-04| B03A| Publication of a patent application or of a certificate of addition of invention [chapter 3.1 patent gazette]|

优先权:

申请号 | 申请日 | 专利标题

US15/806,667|US10515640B2|2017-11-08|2017-11-08|Generating dialogue based on verification scores|

US15/806,667|2017-11-08|

[返回顶部]